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Abstract — Much current research in AI and games is being 
devoted to Monte Carlo search (MCS) algorithms. While the 
quest for a single unified MCS algorithm that would perform 
well on all problems is of major interest for AI, practitioners 
often know in advance the problem they want to solve, and 
spend plenty of time exploiting this knowledge to customize their 
MCS algorithm in a problem-driven way. We propose an MCS 
algorithm discovery scheme to perform this in an automatic 
and reproducible way. We first introduce a grammar over MCS 
algorithms that enables inducing a rich space of candidate 
algorithms. Afterwards, we search in this space for the algorithm 
that performs best on average for a given distribution of training 
problems. We rely on multi-armed bandits to approximately 
solve this optimization problem. The experiments, generated 
on three different domains, show that our approach enables 
discovering algorithms that outperform several well known MCS 
algorithms such as Upper Confidence bounds applied to Trees 
and Nested Monte Carlo search. We also show that the discovered 
algorithms are generally quite robust with respect to changes in 
the distribution over the training problems. 

Index Terms — Monte-Carlo Search, Algorithm Selection, 
Grammar of Algorithms 

I. Introduction 

MONTE CARLO search (MCS) algorithms rely on ran- 
dom simulations to evaluate the quality of states or 
actions in sequential decision making problems. Most of the 
recent progress in MCS algorithms has been obtained by 
integrating smart procedures to select the simulations to be 
performed. This has led to, among other things, the Upper 
Confidence bounds applied to Trees algorithm (UCT, (1 1) that 
was popularized thanks to breakthrough results in computer 
Go J2 1. This algorithm relies on a game tree to store simulation 
statistics and uses this tree to bias the selection of future 
simulations. While UCT is one way to combine random sim- 
ulations with tree search techniques, many other approaches 
are possible. For example, the Nested Monte Carlo (NMC) 
search algorithm B), which obtained excellent results in the 
last General Game Playing competition^] [4 1, relies on nested 
levels of search and does not require storing a game tree. 

How to best bias the choice of simulations is still an 
active topic in MCS-related research. Both UCT and NMC 
are attempts to provide generic techniques that perform well 
on a wide range of problems and that work with little or no 
prior knowledge. While working on such generic algorithms 
is definitely relevant to AI, MCS algorithms are in practice 
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widely used in a totally different scenario, in which a signif- 
icant amount of prior knowledge is available about the game 
or the sequential decision making problem to be solved. 

People applying MCS techniques typically spend plenty 
of time exploiting their knowledge of the target problem 
so as to design more efficient problem-tailored variants of 
MCS. Among the many ways to do this, one very common 
practice is automatic hyper-parameter tuning. By way of 
example, the parameter C > of UCT is in nearly all 
applications tuned through a more or less automatated trial 
and error procedure. While hyper-parameter tuning is a very 
simple form of problem-driven algorithm selection, most of 
the advanced algorithm selection work is done by humans, 
i.e., by researchers that modify or invent new algorithms to 
take the specificities of their problem into account. 

The comparison and development of new MCS algorithms 
given a target problem is mostly a manual search process 
that takes much human time and is error prone. Thanks to 
modem computing power, automatic discovery is becoming 
a credible approach for partly automating this process. In 
order to investigate this research direction, we focus on the 
simplest case of (fully observable) deterministic single-player 
games. Our contribution is twofold. First, we introduce a 
grammar over the MCS algorithms that enables generating a 
very rich space of MCS algorithms. It also describes several 
well known MCS algorithms, using a particularly compact 
and elegant description. Second, we assume that there is 
access to a distribution over the training problems that reflects 
this prior knowledge, and adopt as a performance criterion 
for an algorithm its mean performance on problems drawn 
from this distribution. We rely on multi-armed bandits for 
identifying in a computational efficient way the algorithm 
that maximizes this criterion. Our approach is tested on three 
different domains. The results show that it enables discovering 
new variants of MCS that significantly outperform generic 
algorithms such as UCT or NMC on each of these domains. 
We further show the good robustness properties of the discov- 
ered algorithms by slightly changing the characteristics of the 
problem. 

This paper is structured as follows. Section [D] formalizes 
the class of sequential decision making problems considered 
in this paper and formalizes the corresponding MCS algorithm 



discovery problem. Section III describes our grammar over 
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MCS algorithms and describes several well known MCS 
algorithms in terms of this grammar. Section [TV] formalizes 
the search for a good MCS algorithm as a multi-armed bandit 
problem. We experimentally evaluate our approach on different 
domains in Section [V] Finally, we discuss related work in 
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Section [VI] and conclude in Section IVIII 

II. Problem statement 

We consider the class of finite horizon fully observable 
deterministic sequential decision making problems. A problem 
P is a triple (xi,f,g) where X\ € X is the initial state, / 
is the transition function, and g is the reward function. The 
dynamics of a problem is described by 

x t +i = f(x t ,u t ) t = l,2,...,T, (1) 

where for all t, the state x t is an element of the state space X 
and the action ut is an element of the action space U. In the 
context of one player games, xt denotes the current state of the 
game and U are the possible moves. We make no assumptions 
on the nature of X but assume that U is finite. We denote by 
U x the set of actions (moves) which are legal in state x £ X. 
We assume that when starting from x\, the system enters a 
final state after T steps and we denote by T C X the set of 
these final states^ A final state x £ T is associated to rewards 
g(x) £ K that should be maximized. 

A search algorithm A(-) is a stochastic algorithm that 
explores the possible sequences of actions to approximatively 
maximize 

A(P = (xx,f,g)) ^ argmax3(x T +i) , (2) 

subject to Xt+i = f(x t ,Ut) and u t £ U Xt . In order to 
fulfill this task, the algorithm is given a finite amount of 
computational time, referred to as the budget. Moreover, to 
facilitate its reproducibility, we focus in this paper on a budget 
expressed as the maximum number B > of sequences 
(ui, . . . ,Ux) that can be evaluated, or, equivalently, as the 
number of calls to the reward function <?(•). Note, however, 
that it is trivial in our approach to replace this definition by 
other budget measures, such as CPU time. 

We express our knowledge of the problem as a distribution 
over problem instances Dp, from which we can sample any 
number of random instances P ~ Dp. The quality of a search 
algorithm A B {-) with budget B on these instances is denoted 
by J % (Dp) and is defined as the expected quality of solutions 
found on problems drawn from Dp: 

Ja( v p) = Ep~t> p {E Xt+1 ~ab(p){9(xt+i)}} , (3) 

where xt+i ~ A B (P) denotes the final states returned by 
algorithm A with budget B on problem P. 

Given a class of candidate algorithms A and given the bud- 
get B, the algorithm discovery problem amounts to selecting 
an algorithm A* £ A of maximal quality: 

A* = argmax Jf(D P ) . (4) 
AeA 

2 In many problems, the time at which the game enters a final state is not 
fixed, but depends on the actions played so far. It should however be noted 
that it is possible to make these problems fit this fixed finite time formalism 
by postponing artificially the end of the game until T. This can be done, for 
example, by considering that when the game ends before T, a "pseudo final 
state" is reached from which, whatever the actions taken, the game will reach 
the real final state in T. 



The two main contributions of this paper are: (i) a grammar 
that enables inducing a rich space A of candidate MCS 
algorithms, and (ii) an efficient procedure to approximately 
solve Eq. [4] 

III. A GRAMMAR FOR MONTE-CARLO SEARCH 
ALGORITHMS 

All MCS algorithms share some common underlying gen- 
eral principles: random simulations, look-ahead search, time- 
receding control, and bandit-based selection. The grammar that 
we introduce in this section aims at capturing these principles 
in a pure and atomic way. We first give an overall view of 
our approach, then present in detail the components of our 
grammar, and finally describe previously proposed algorithms 
by using this grammar. 

A. Overall view 

We call search components the elements on which our 
grammar operates. Formally, a search component is a stochas- 
tic algorithm that, when given a partial sequence of ac- 
tions (u±, . . . , Ut-i), generates one or multiple completions 
(ut, ■ ■ ■ , Ut) and evaluates them using the reward function 
g(-). The search components are denoted by S £ S, where 
S is the space of all possible search components. Let S be a 
particular search component. We define the search algorithm 
As £ A as the algorithm that, given the problem P, executes 
S repeatedly with an empty partial sequence of actions (), until 
the computational budget is exhausted. The search algorithm 
As then returns the sequence of actions (iti, . . . , ut) that led 
to the highest reward g(-). 

In order to generate a rich class of search components — 
hence a rich class of search algorithms — in an inductive way, 
we rely on search-component generators. Such generators 
are functions ^> : 9 — > S that define a search component 
S = *f?(6) £ S when given a set of parameters 9 £ O. Our 
grammar is composed of five search component generators: 
\1/ £ {simulate, repeat, lookahead, step, select}. Here is a 
short description of the search components they generate: 

• simulate(ir 3lmu ): Generate one single completion 
(lit, . . . , Ut) using simulation policy ^ Slmu _ 

• repeat(N, S): Execute N times the search component S. 

• lookahead(S): For each valid action Uf, move to succes- 
sor state f(xt,ut) and run sub-search component S. 

• step(S): For each r £ [t,T], run sub-search component 
S, select action u T towards the best currently found final 
state, and move one step ahead. 

• select(ir sel , S): Use a selection policy recursively to 
select one or multiple actions Ut, Ut+i, • • • , run search 
component S, and back propagate the sub-search result. 

Note that four of our search component generators are 
parametrized by sub-search components. For example, step 
and lookahead are functions S — » S. These functions can be 
nested recursively to generate more and more evolved search 
components. We construct the space of search algorithms A 
by performing this in a systematic way, as detailed in Section 
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B. Search components 

Figure [T] describes our five search component generators. 
Note that we distinguish between search component inputs 
and search component generator parameters. All our search 
components have the same two inputs: the sequence of already 
decided actions (ui, . . . , Ut-i) and the current state xt € X. 
The parameters differ from one search component generator 
to another. For example, simulate is parametrized by a 
simulation policy n slmu and repeat is parametrized by the 
number of repetitions N > and by a sub-search component. 
We now give a detailed description of these search component 
generators. 

Simulate The simulate generator is parametrized by a policy 
^stmu g Yl slmu which is a stochastic mapping from states to 
actions: u ~ Tr slmu (x). In order to generate the completion 
(u t , . . . , ut), simulate repeatedly samples actions u T accord- 
ing to ir slmu (x T ) and performs transitions x T+ \ = f(x T ,u T ) 
until reaching a final state. A default choice for the simulation 
policy is the uniformly random policy, defined as 



E{tt 



random 



(x) = u} 



if u £ U x 
otherwise. 



(5) 



Once the completion (ut, ■ ■ ■ ,ut) is fulfilled, the whole se- 
quence (u\, . . . ,ut) is yielded. This operation is detailed in 
Figure[2]and proceeds as follows: (i) it computes the reward of 
the final state xt+i, (ii) if the reward is larger than the largest 
reward found previously, it replaces the best current solution, 
and (iii) if the budget B is exhausted, it stops the search. 

Since algorithm Ap repeats P until the budget is exhausted, 
the search algorithm A S i mu i ate € A is Iterative Sampling: it 
samples B random trajectories (ui, . . . , Ut), evaluates each of 
the final state rewards g(xx+i), and returns the best found final 
state. Note that, in the YIELD procedure, the variables relative 
to the best current solution (r* and («J, . . . , Uj-)) are defined 
locally for each search component, whereas the numCalls 
counter is global to the search algorithm. This means that if 
S is a search component composed of different nested levels 
of search (see the examples below), the best current solution 
is kept in memory at each level of search. 

Repeat Given a positive integer N > and a search 
component S € S, repeat(N, S) is the search component 
that repeats N times the search component S. For example, 
S = repeat(lO, simulate(ir slmu )) is the search component 
that draws 10 random simulations using ^ simu . The corre- 
sponding search algorithm As is again iterative sampling, 
since search algorithms repeat their search component until the 
budget is exhausted. In Figure [T] we use the INVOKE operation 
each time a search component calls a sub-search component. 
This operation is detailed in Figure [2] and ensures that no sub- 
search algorithm is called when a final state is reached, i.e., 
when t = T + 1. 

Look-ahead For each legal move u t € U Xt , lookahead(S) 
computes the successor state Xt+i — f(xt,u t ) and runs the 
sub-search component S € S starting from the sequence 
(ui,...,ut). For example, lookahead{simulate{'K amu )) 
is the search component that, given the partial sequence 



Fig. 1. Search component generators 

Simulate((mi, . . . ,ut-\),x t ) 
Param: n smu € w imu 

for t = t to T do 

u T - ir smu (x T ) 

Xr+1 <- f(x T ,U T ) 

end for 

YIELD((ui,...,U T )) 



REPEAT((lti, . . . ,Ut-i),X t ) 
Param: N > 0, S e S 

for i = 1 to N do 

INVOKE^, (iti, . . . ,U t _i),X t ) 

end for 



LookAhead((mi, . . . , ut-i),x t ) 
Param: S e S 

for u t e U Xt do 

x t +i <~ f(x t ,u t ) 

INVOKE^, {ui,...,U t ),X t +l) 

end for 



STEP((«i,...,lit_i),x t ) 
Param: S eS 

for r = t to T do 

invoke(5, (iti , . . . , u r _ i),x T ) 
u T u* 

x T +i <- f(x T ,u T ) 
end for 



> Select 



Select(Oi, . . .,Ut-i),X t ) 
Param: Tr sel e IL sel ,S € S 

for r = t to T do 

U T ~ 7T Se '(x) 

x T+ i <- f(x T ,u r ) 
if n(x T+ i) = then 

break 
end if 
end for 

tleaf T 

INVOKE^, (iti,..., U tuaf ),X tleaf+1 ) 
for T = t tea f tO 1 dO 

n(x T+ i) <- n(x T+1 ) + 1 
n(x T ,u T ) <— n(x Tl u T ) + 1 
s(x T , u T ) 4— s(x T , u T ) + r* 
end for 

n(x\) <r- n(xi) + 1 



(ui, . . . , Ut-x), generates one random trajectory for 
each legal next action u t E U Xt . Multiple-step look- 



> Sub-search 
> Backpropagate 
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Fig. 2. Yield and invoke commands 

Require: g : J- —> R, the reward function 
Require: B > 0, the computational budget 
Initialize global: numCalls <— 

Initialize local: r* < oo 

Initialize local: (u*, . . . , u^) ■<— 
procedure Yield((ux, . . . , u T )) 
r = g(x) 
if r > r* then 
r <— r 

K>---,4) <- (ui,...,u T ) 

end if 

numCalls «— numCalls + 1 
if numCalls = B then 

stop search 
end if 
end procedure 

procedure Invoke(S <e S, (u±, . . . ,ih~i) & U*,x t G X) 
itt<T then 

S((ui, . . .,u t -i),x t ) 

else 

yield {u 1} .. . ,u T ) 
end if 
end procedure 



ahead search strategies naturally write themselves 
with nested calls to lookahead. As an example, 
lookahead(lookahead(lookahead(simulate(TT slmu )))) 
is a search component that runs one random trajectory per 
legal combination of the three next actions (uj, Ut+i, Ut+z)- 

Step For each remaining time step r € [t>T], 
step(S) runs the sub-search component S, extracts 
the action u T from (u*,...,Uj,) (the best currently 
found action sequence, see Figure [2]), and performs 
transition x T+ i — f(x T ,u T ). The search component 
generator step enables implementing time receding search 
mechanisms, e.g., step(repeat(100, simulate(ir slmu ))) 
is the search component that selects the actions 
(ui,...,ut) one by one, using 100 random trajectories 
to select each action. As a more evolved example, 
step(lookahead(lookahead(repeat(10, simulation(iv slmu ))))) is 
a time receding strategy that performs 10 random simulations 
for each two first actions (u t ,Ut+i) to decide which action 
u t to select. 

Select This search component generator implements most 
of the behaviour of a Monte Carlo Tree Search (MCTS, 
Q). It relies on a game tree, which is a non-uniform look- 
ahead tree with nodes corresponding to states and edges 
corresponding to transitions. The role of this tree is twofold: 
it stores statistics on the outcomes of sub-searches and it 
is used to bias sub-searches towards promising sequences of 
actions. A search component select (ir sel , S) proceeds in three 
steps: the selection step relies on the statistics stored in the 
game tree to select a (typically small) sub-sequence of actions 
(ut, ■ ■ ■ , ut laaf ), the sub-search step invokes the sub-search 
component S G S starting from (ui, . . . ,Ut leaf ), and the 



backpropagation step updates the statistics to take into account 
the sub-search result. 

We use the following notation to denote the information 
stored by the look-ahead tree: n(x,u) is the number of times 
the action u was selected in state x, s(x,u) is the sum of 
rewards that were obtained when running sub-search after 
having selected action u in state x, and n(x) is the number 
of times state x was selected: n(x) = Ylueu n ( x i u )- In 
order to quantify the quality of a sub-search, we rely on 
the reward of the best solution that was tried during that 
sub-search: r* — maxg(x). In the simplest case, when the 
sub-search component is S = simulate(Tr slmu ), r* is the 
reward associated to the final state obtained by making the 
random simulation with policy 7r slmu , as usual in MCTS. 
In order to select the first actions, selection relies on a 
selection policy tt sc1 € H sel , which is a stochastic function 
that, when given all stored information related to state x 
(i.e., n(x), n(x,u), and s(x,u),\/u <= U x ), selects an action 
u G U x . The selection policy has two contradictory goals 
to pursue: exploration, trying new sequences of actions to 
increase knowledge, and exploitation, using current knowledge 
to bias computational efforts towards promising sequences of 
actions. Such exploration/exploitation dilemmas are usually 
formalized as a multi-armed bandit problem, hence n sel is 
typically one of policies commonly found in the multi-armed 
bandit literature. The probably most well known such policy 
is UCB-1 @: 



7Tp cfc 1 (x) = argmax 



s(x, u) 
n(x, u) 



C, 



'lnn(a;, u) 
n{x) 



(6) 



where division by zero returns +oo and where C > is a 
hyper-parameter that enables the control of the exploration / 
exploitation tradeoff. 

C. Description of previously proposed algorithms 

Our grammar enables generating a very large class of 
MCS algorithms, which includes several already proposed 
algorithms. We now overview these algorithms, which can be 
described particularly compactly and elegantly thanks to our 
grammar: 

• The simplest Monte Carlo algorithm is Iterative Sam- 
pling. This algorithm draws random simulations until 
the computational time is elapsed and returns the best 
solution found [6|: 



is = simulate(ir s,imu ) . 



(7) 



In general, iterative sampling is used during a certain time 
to decide which action to select (or which move to play) 
at each step of the decision problem. The corresponding 
search component is 



is' = step(repeat(N, simulate(n slmu ))) , 



(8) 



where N is the number of simulations performed for each 
decision step. 

The Reflexive Monte Carlo search algorithm introduced in 
[7] proposes using a Monte Carlo search of a given level 
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to improve the search of the upper level. The proposed 
algorithm can be described as follows: 



rmc(N±, N 2 ) = step(repeat(Ni, 

step(repeat(N 2 , simulate^ 8 



'))))), (9) 



where N\ and N 2 are called the number of meta-games 
and the number of games, respectively. 
The Nested Monte Carlo (NMC) search algorithm is a 
recursively defined algorithm generalizing the ideas of 
Reflexive Monte Carlo search. NMC can be described in 
a very natural way by our grammar. The basic search level 
I = of NMC simply performs a random simulation: 



nmc(O) = simulate^ 



i don I 



(10) 



The level I > of NMC relies on level I - 1 in the 
following way: 



nmc(l) = step(lookahead(nmc(l — 1))) 



(11) 



IV. Bandit-based algorithm discovery 

We now move to the problem of solving Eq. |4] i.e., of 
finding, for a given problem, the best algorithm A from 
among a large class A of algorithms derived with the grammar 
previously defined. Solving this algorithm discovery problem 
exactly is impossible in the general case since the objective 
function involves two infinite expectations: one over the prob- 
lems P ~ Up and another over the outcomes of the algorithm. 
In order to approximately solve Eq. |4] we adopt the formalism 
of multi-armed bandits and proceed in two steps: we first 
construct a finite set of candidate algorithms Ad,t C A 



(Section IV-A i, and then treat each of these algorithms as an 



arm and use a multi-armed bandit policy to select how to 
allocate computational time to the performance estimation of 



the different algorithms (Section IV-B I. It is worth mentioning 
that this two-step approach follows a general methodology for 
automatic discovery that we already successfully applied to 
multi-armed bandit policy discovery (5J, flO) , reinforcement 
learning policy discovery fTT) , and optimal control policy 
discovery fl2). 



Single-player MCTS selects actions one after the other. 
In order to select one action, it relies on select combined 
with random simulations. The corresponding search com- 
ponent is thus 



mcts(iT 



sel sirau 



N) = step(repeat(N , 



select{ir sel ,simulate{-K slmu )))) , (12) 



where N is the number of iterations allocated to each 
decision step. UCT is one of the best known variants 
of MCTS. It relies on the 7r^, cb_1 selection policy and 
is generally used with a uniformly random simulation 
policy: 



uct(C, N) — mcts(TT t 



ucb- 
C 



1 random 
, 7T 



AO 



(13) 



In the spirit of the work on nested Monte Carlo, the 
authors of [ 8 1 proposed the Meta MCTS approach, which 
replaces the simulation part of an upper-level MCTS 
algorithm by a whole lower-level MCTS algorithm. While 
they presented this approach in the context of two-player 
games, we can describe its equivalent for one-player 
games with our grammar: 



metamcts{n sel , Tt smu , N u N 2 ) 

step(repeat(Ni, select^ 



sel 



mcts(Tr ! 



el ,ir smu ,N 2 )) (14) 



where N± and N 2 are the budgets for the higher-level and 
lower-level MCTS algorithms, respectively. 

In addition to offering a framework for describing these 
already proposed algorithms, our grammar enables generating 
a huge number of new hybrid MCS variants. We give, in the 
next section, a procedure to automatically identify the best 
such variant for a given problem. 



A. Construction of the algorithm space 

We measure the complexity of a search component S G S 
using its depth, defined as the number of nested search com- 
ponents constituting S, and denote this quantity by depth(S). 
For example, depth(simulate(ir s ' lmu )) is 1, depth(uct) is 4, 
and depth(nmc(3)) is 7. 

Note that simulate, repeat, and select have parameters 
which are not search components: the simulation policy Tr mmu , 
the number of repetitions N, and the selection policy Tr sel , 
respectively. In order to generate a finite set of algorithms 
using our grammar, we rely on predefined finite sets of 
possible values for each of these parameters. We denote by 
r the set of these finite domains. The discrete set Ad,v is 
constructed by enumerating all possible algorithms up to depth 
D with constants T, and is pruned using the following rules: 

• Canonization of repeat: Both search components 
Si = step(repeat(2,repeat(5, S suo ))) and S 2 = 
step(repeat(5,repeat(2, S su b))) involve running S su b 
10 times at each step. In order to avoid having this 
kind of algorithm duplicated, we collapse nested repeat 
components into single repeat components. With this 
rule, Si and S2 both reduce to step(repeat(10, S su b)). 

• Removal of nested selects: A search component such as 
select(ir sel , select (ir sel , S)) is ill-defined, since the inner 
select will be called with a different initial state x t each 
time, making it behave randomly. We therefore exclude 
search components involving two directly nested selects. 

• Removal of repeat-as-root: Remember that the MCS algo- 
rithm As G A runs S repeatedly until the computational 
budget is exhausted. Due to this repetition, algorithms 



such as A 



simulate(iT B 



and A 



repeat ( 10, simulate(7z slrnu )) 



are equivalent. To remove these duplicates, we reject all 
search components whose "root" is repeat. 
In the following, v denote the cardinality of the set of 
candidate algorithms: Ad,v = {A\, . . . ,A V }. To illustrate 
the construction of this set, consider a simple case where the 
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Depth 1-2 
simlate 

lookahead(sim) 
step(sim) 
select(sim) 



Depth 

lookahead(repeat(2, sim)) 
lookahead(repeat(l(), sim)) 
lookahead(lookahead(sim)) 
lookahead(step(sim)) 
lookahead(select(sim)) 
select(repeat(2, sim)) 
select(lookahead(sim)) 



step(repeat(2, sim)) 
step(repeat(10, sim)) 
step(lookahead(sim)) 
step(step(sim)) 
step(select(sim)) 
select(repeat(10, sim)) 
select(step(sim)) 



TABLE I 

Unique algorithms up to depth 3 



maximum depth is D = 3 and where the constants T are 



^random 



N e {2, 10}, and tt 



sel 



cb-1 



The 



corresponding space Ad.t contains v = 18 algorithms. These 
algorithms are given in Table [I] where we use sim as an 
abbreviation for simulate(Tr smu ). 



B. Bandit-based algorithm discovery 

One simple approach to approximately solve Eq. |4]is to es- 
timate the objective function through an empirical mean com- 
puted using a finite set of training problems {P^\ ■ ■ ■ , P( M '}, 
drawn from Dp: 



Ja(Pp) 



1 M 

— Y 

M ^ 

i=l 



g(x T+1 )\x T+1 ~ A B (PW) 



(15) 



where Xt+i denotes one outcome of algorithm A with budget 
B on problem pW, To solve Eq. |4] one can then compute 
this approximated objective function for all algorithms A E 
Ad,t and simply return the algorithm with the highest score. 
While extremely simple to implement, such an approach often 
requires an excessively large number of samples M to work 
well, since the variance of g(-) may be quite large. 

In order to optimize Eq. |4] in a smarter way, we propose 
to formalize this problem as a multi-armed bandit problem. 
To each algorithm Ak £ Ad.t, we associate an arm. Pulling 

the arm k for the t^th time involves selecting the problem 

p(t k ) and 

running the algorithm Ak once on this problem. 
This leads to a reward associated to arm k whose value is the 
reward g(xT+i) that comes with the solution xt+i found by 
algorithm Ak- The purpose of multi-armed bandit algorithms 
is to process the sequence of observed rewards to select in a 
smart way the next algorithm to be tried, so that when the time 
allocated to algorithm discovery is exhausted, one (or several) 
high-quality algorithm(s) can be identified. How to select arms 
so as to identify the best one in a finite amount of time is 
known as the pure exploration multi-armed bandit problem 
| |T3) . It has been shown that index based policies based on 
upper confidence bounds such as UCB-1 were also good 
policies for solving pure exploration bandit problems. Our 
optimization procedure works thus by repeatedly playing arms 
according to such a policy. In our experiments, we perform 
a fixed number of such iterations. In practice, this multi- 
armed bandit approach can provide an answer at anytime, but 
returning the algorithm Ak with the currently highest empirical 
reward mean. 



C. Discussion 

Note that other approaches could be considered for solving 
our algorithm discovery problem. In particular, optimization 
over expression spaces induced by a grammar such as ours 
is often solved using Genetic Programming (GP, [14]). GP 
works by evolving a population of solutions, which, in our 
case, would be MCS algorithms. At each iteration, the current 
population is evaluated, the less good solutions are removed, 
and the best solutions are used to construct new candidates 
using mutation and cross-over operations. Most existing GP 
algorithms assume that the objective function is (at least 
approximately) deterministic. One major advantage of the 
bandit-based approach is to natively take into account the 
stochasticity of the objective function and its decomposability 
into problems. Thanks to the bandit formulation, badly per- 
forming algorithms are quickly rejected and the computational 
power is more and more focused on the most promising 
algorithms. 

The main strengths of our bandit-based approach are the 
following. First, it is simple to implement and does not require 
entering into the details of complex mutation and cross-over 
operators. Second, it has only one hyper-parameter (the explo- 
ration/exploitation coefficient), for which there exists a robust 
default setting. Finally, since it is based on exhaustive search 
and on multi-armed bandit theory, formal guarantees can easily 
be derived to bound the regret, i.e., the difference between the 
performance of the best algorithm and the performance of the 
algorithm discovered p"3] , fl3| , |[T5J. 

Our approach is restricted to relatively small depths D since 
it relies on exhaustive search. In our case, we believe that many 
interesting MCS algorithms can be described using search 
components with low depth. In our experiments, we used D = 
5, which already provides many original hybrid algorithms 
that deserve further research. If this limit was too restrictive, 
a major way of improvement would consist in combining the 
idea of bandits with the ideas of GP. In this spirit, the authors 
of JT7) recently proposed a hybrid approach in which the 
selection of the members of a new population is posed as 
a multi-armed bandit problem. This enables combining the 
best of the two approaches: multi-armed bandits enable taking 
natively into account the stochasticity and decomposability 
of the objective function, while GP cross-over and mutation 
operators are used to generate new candidates dynamically in 
a smart way. 

V. Experiments 

We now apply our automatic algorithm discovery approach 
to three different testbeds: Sudoku, Symbolic Regression, 
and Morpion Solitaire. The aim of our experiments was to 
show that our approach discovers MCS algorithms that sig- 
nificantly outperform several generic (problem independent) 
MCS algorithms: outperforms them on the training instances, 
on new testing instances, and even on instances drawn from 
distributions different from the original distribution used for 
the learning. 



We first describe the experimental protocol in Section V-A 



Sections V-B V-C and V-D then give the results obtained 
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on the three domains. Finally, Section |V-E| gives an overall 
discussion of our results. 



A. Protocol 

We first describe the two different kinds of algorithms 
that we compare in our experiments: generic algorithms and 
discovered algorithms. 

Generic algorithms The generic algorithms are Nested 
Monte Carlo, Upper Confidence bounds applied to Trees, 
Look-ahead Search, and Iterative sampling. The search compo- 
nents for Nested Monte Carlo (rime), UCT (uct), and Iterative 



sampling (is) have already been defined in Section III-C The 
search component for Look-ahead Search of level I > is 
defined by la(l) = step(larec(l)), where 



larec(l) 



J lookahead(larec(l — 1)) if / > 
1 simulate(ir random ) 



otherwise. 



(16) 



For both la(-) and nmc(-), we try all values within the range 
[1, 5] for the level parameter. Note that la(l) and nmc(l) are 
equivalent, since both are defined by the search component 
step(lookahead(simulate(ir random ))). For uct(-), we try the 
following values of G: {0,0.3,0.5,1.0} and set the budget 
per step to where B is the total budget and T is the 
horizon of the problem. This leads to the following set of 
generic algorithms: {nmc(2), nmc(3), nmc(4), nmc(5), is, 
la{l), la(2), la(3), la(4), la(5), uct(0), wc*(0.3), uct(0.5), 
and uct(l)}. Note that we omit the ~ parameter in uct for 
the sake of conciseness. 

Discovered algorithms In order to generate the set of 
candidate algorithms, we used the following constants T: 
repeat can be used with 2, 5, 10, or 100 repetitions; and select 
relies on the UCB1 selection policy from Eq. |6]) with the 
constants {0, 0.3, 0.5, 1.0}. We create a pool of algorithms by 
exhaustively generating all possible combinations of the search 
components up to depth D = 5. We apply the pruning rules 
described in Section [IV-A| which results in a set of v — 3, 155 
candidate MCS algorithms. 

Algorithm discovery In order to carry out the algorithm 
discovery, we used a UCB policy for 100 x v time steps, 
i.e., each candidate algorithm was executed 100 times on 
average. As discussed in Section IV-B each bandit step 



involves running one of the candidate algorithms on a problem 
P Dp. We refer to Dp as the training distribution in the 
following. Once we have played the UCB policy for 100 x v 
time steps, we sort the algorithms by their average training 
performance and report the ten best algorithms. 

Evaluation Since algorithm discovery is a form of "learn- 
ing from examples", care must be taken with overfitting 
issues. Indeed, the discovered algorithms may perform well 
on the training problems P while performing poorly on other 
problems drawn from Dp. Therefore, to evaluate the MCS 
algorithms, we used a set of 1, 000 testing problems P ~ Dp 
which are different from the training problems. We then 
evaluate the score of an algorithm as the mean performance 
obtained when running it once on each testing problem. 



In each domain, we futher test the algorithms either by 
changing the budget B and/or by using a new distribution 
T>' p that differs from the training distribution Dp. In each 
such experiment, we draw 100 problems from D' p and run 
the algorithm once on each problem. 

In one domain (Morpion Solitaire), we used a particular 
case of our general setting, in which there was a single 
training problem P, i.e., the distribution Dp was degenerate 
and always returned the same P. In this case, we focused our 
analysis on the robustness of the discovered algorithms when 
tested on a new problem P' and/or with a new budget B. 

B. Sudoku 

Sudoku, a Japanese term meaning "singular number", is a 
popular puzzle played around the world. The Sudoku puzzle 
is made of a grid of G 2 x G 2 cells, which is structured into 
blocks of size G x G. When starting the puzzle, some cells are 
already filled in and the objective is to fill in the remaining 
cells with the numbers 1 through G 2 so that 

• no row contains two instances of the same number, 

• no column contains two instances of the same number, 

• no block contains two instances of the same number. 
Sudoku is of particular interest in our case because each 

Sudoku grid corresponds to a different initial state x\. Thus, a 
good algorithm A(-) is one that intrinsically has the versatility 
to face a wide variety of Sudoku grids. 

In our implementation, we maintain for each cell the list 
of numbers that could be put in that cell without violating 
any of the three previous rules. If one of these lists becomes 
empty then the grid cannot be solved and we pass to a final 
state (see Footnote 2). Otherwise, we select the subset of cells 
whose number-list has the lowest cardinality, and define one 
action u £ U x per possible number in each of these cells (as 
in p|). The reward associated to a final state is its proportion 
of filled cells, hence a reward of 1 is associated to a perfectly 
filled grid. 

We sample the initial states X\ by filling 33% randomly 
selected cells as proposed in [3|. We denote by Sudoku(G) 
the distribution over Sudoku problems obtained with this 
procedure (in the case of G 2 x G 2 games). Even though 
Sudoku is most usually played with G = 3 fl8) , we carry out 
the algorithm discovery with G = 4 to make the problem more 
difficult. Our training distribution was thus Dp = Sudoku(4) 
and we used a training budget of B = 1, 000. To evaluate the 
performance and robustness of the algorithms found, we tested 
the MCS algorithms on two distributions: Dp = Sudoku(4) 
and D'p = Sudoku(5), using a budget of B — 1,000. 

Table [II] presents the results, where the scores are the 
average number of filled cells, which is given by the reward 
times the total number of cells G 4 . Note that the results 
have been sorted by decreasing testing scores on Dp. In each 
column, we underline both the best generic algorithm and 
the best discovered algorithm and show in bold all cases in 
which a discovered algorithm outperforms all tested generic 
algorithms. 

The first result we observe is the overall better performance 
of the discovered algorithms over the generic ones. When 
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TABLE II 

Ranking and Robustness of Algorithms Discovered when Applied to Sudoku 



Name 


Search Component 


Rank 


Sudoku(4) 


Sudoku(5) 


Dis#l 


select(step(repeat(select(simulate, 1), 5)), 1) 


1 


197.3 


480.7 


Dis#8 


step(select(repeat(select(simulate, 0.5), 5), 0)) 


2 


197.2 


481.2 


Dis#2 


step(repeat(step(repeat(simulate, 5)), 10)) 


3 


197.2 


484.9 


Dis#7 


lookahead(step(repeat(select(simulate, 0.3), 5))) 


4 


197.0 


487.5 


Dis#6 


step(step(repeat(select(simulate, 0), 5))) 


5 


196.8 


482.0 


Dis#4 


step(step(step(select(simulate, 0.5)))) 


6 


196.7 


491.3 


Dis#10 


select(step(repeat(select(simulate, 0.3), 5) 


7 


196.5 


480.3 


Dis#9 


lookahead(step(step(select(simulate, 1)))) 


8 


196.3 


493.6 


Dis#5 


select(step(repeat(simulate, 5)), 0.5) 


9 


196.0 


478.8 


Dis#3 


step(select(step(simulate), 1)) 


10 


195.9 


496.5 


uct(0.5) 




11 


194.7 


481.6 


uct(O) 




12 


194.5 


485.5 


la(l) 




13 


193.7 


429.7 


nmc(2) 




14 


192.9 


423.4 


uct(0.3) 




15 


192.8 


480.3 


n m c ( x \ 

lllllL/^ J ) 




16 


192 4 


428 6 


uct(l) 




17 


191.0 


482.3 


nmc(4) 




18 


190.3 


424.50 


nmc(5) 




19 


190.0 


425.60 


la(2) 




20 


173.5 


391.7 


la(3) 




21 


168.7 


389.4 


la(4) 




22 


168.5 


388.9 


la(5) 




23 


168.4 


386.4 


is 




24 


168.1 


389.40 



testing on the distribution Dp = Sudoku(4), we observe that 
all ten discovered algorithms outperform the best generic algo- 
rithm, uct(0.5). This shows that our grammar generated new 
algorithms which are specifically relevant to the Sudoku(4) 
task. When running the algorithms on the Sudoku(5) games, 
we observe that uci(0) becomes the best generic algorithm. In 
this case, four of our discovered algorithms still remain better 
than UCT and all ten discovered algorithms still perform well. 
Sudoku(4) and Sudoku(5) are thus sufficiently close problems 
to make algorithms discovered for Sudoku(4) relevant to 
Sudoku(5). 

We observe some frequent patterns in the search compo- 
nents of the discovered algorithms: five search components 
are based on a double nested step and nine out of the ten 
discovered algorithms rely on select. 

C. Real Valued Symbolic Regression 

Symbolic Regression consists in searching in a large space 
of symbolic expressions for the one that best fits a given 
regression dataset. Usually this problem is treated using 
Genetic Programming approaches. In the line of [19], we 
here consider MCS techniques as an interesting alternative 
to Genetic Programming. In order to apply MCS techniques, 
we encode the expressions as sequences of symbols. We 
adopt the Reverse Polish Notation (RPN) to avoid the use 
of parentheses. As an example, the sequence [a, b, +, c, *] 
encodes the expression (a + b)*c. The alphabet of symbols we 
used is {.t, 1, +,—,*,/, sin, cos, log, exp, stop}. The initial 
state x\ is the empty RPN sequence. Each action u then adds 
one of these symbols to the sequence. When computing the set 
of valid actions U x , we reject symbols that lead to invalid RPN 
sequences, such as [+,+,+]. A final state is reached either 
when the sequence length is equal to a predefined maximum 



Target Expression f p (-) 


Domain 


X 3 + X 1 + X 


[-1,1] 


x 4 + x 3 + x 2 + X 


[-1,1] 


X 5 + X 4 + X 3 + X 2 + X 


[-1,1] 


x G + x 5 + x 4 + X 3 + X 2 + X 


[-1,1] 


sin{x 2 ) cos(x) — 1 


[-1,1] 


sin(x) + sin(x + x 2 ) 


[-1,1] 


log(x + 1) + log(rc 2 + 1) 


[0,2] 


\fx 


[0,4] 


TABLE III 





Symbolic Regression Testbed: target expressions and domains. 



Target Expression f p (■) 


Domain 


X 3 — X 2 — X 


[-1,1] 


X — X — X — X 


[-1,1] 


x 4 + sin(a:) 


[-1,1] 


cos(x 3 ) + sin(x + 1) 


[-1,1] 


vTz) + 


[0,4] 


x G + 1 


[-1,1] 


sin(a; 3 + x 2 ) 


[-1,1] 


log(x 3 + 1) + X 


[0,2] 



TABLE IV 

Symbolic Regression Robustness Testbed: target expressions 
and domains. 



T or when the symbol stop is played. In our experiments, we 
performed the training with a maximal length of T = 11. The 
reward associated to a final state is equal to 1 — mae, where 
mae is the mean absolute error associated to the expression 
built. 

We used a synthetic benchmark, which is classical in the 
field of Genetic Programming fl20) . To each problem P of this 
benchmark is associated a target expression f p ( ) G R, and 
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TABLE V 

Ranking and Robustness of the Algorithms Discovered when Applied to Symbolic Regression 



Name 


Search Component 


Rank 


T = 11 


T = 21 


T = 11, B = 10 J 


V' p 


Dis#l 


step(step(lookahead(lookahead(simulate)))) 


1 


0.066 


0.080 


0.036 


0.101 


Dis#5 


step(repeat(lookahead(lookahead(simulate)), 2)) 


2 


0.069 


0.085 


0.037 


0.106 


Dis#2 


step(lookahead(lookahead(repeat(simulate, 2)))) 


3 


0.070 


0.086 


0.038 


0.100 


Dis#7 


step(lookahead(lookahead(select(simulate, 1 )))) 


4 


0.070 


0.087 


0.040 


0.103 


Dis#6 


step(lookahead(lookahead(select(simulate, 0)))) 


5 


0.071 


0.088 


0.040 


0.110 


Dis#4 


step(lookahead(select(lookahead(simulate), 0))) 


5 


0.071 


0.087 


0.039 


0.101 


Dis#3 


step(lookahead(lookahead(simulate))) 


5 


0.071 


0.086 


0.056 


0.100 


la(2) 




5 


0.071 


0.086 


0.056 


0.100 


Dis#10 


step(lookahead(select(lookahead(simulate), 0.3))) 


9 


0.072 


0.093 


0.040 


0.108 


Dis#8 


step(lookahead(repeat(lookahead(simulate), 2))) 


9 


0.072 


0.085 


0.040 


0.112 


la(3) 




11 


0.073 


0.094 


0.053 


0.101 


Dis#9 


step(repeat(select(lookahead(simulate), 0.3), 5)) 


12 


0.078 


0.093 


0.048 


0.099 


nmc(2) 




13 


0.081 


0.103 


0.054 


0.109 


nmc(3) 




14 


0.083 


0.105 


0.053 


0.118 


la(4) 




15 


0.089 


0.115 


0.057 


0.101 


n m r*( A \ 

1 1 1 1 y-r ) 




16 


094 


106 


059 


140 


la(l) 




17 


0.098 


0.114 


0.066 


0.120 


la(5) 




18 


0.099 


0.124 


0.058 


0.101 


is 




19 


0.120 


0.142 


0.086 


0.138 


nmc(5) 




20 


0.121 


0.122 


0.069 


0.140 


uct(O) 




21 


0.151 


0.130 


0.124 


0.185 


uct(l) 




22 


0.154 


0.127 


0.118 


0.160 


uct(0.3) 




23 


0.155 


0.129 


0.135 


0.177 


uct(0.5) 




24 


0.156 


0.127 


0.124 


0.184 



the aim is to re-discover this target expression given a finite 
set of samples (x,f p (x)). Table III illustrates these target 
expressions. In each case, we used 20 samples (x,f p (x)), 
where x was obtained by taking uniformly spaced elements 
from the indicated domains. The training distribution Vp was 
the uniform distribution over the eight problems given in Table 
HID 

The training budget was B — 10, 000. We evaluate the 
robustness of the algorithms found in three different ways: by 
changing the maximal length T from 11 to 21, by increasing 
the budget B from 10,000 to 100,000 and by testing them on 
another distribution of problems T>' p . The distribution T>' p is 
the uniform distribution over the eight new problems given in 
Table |TV] 

The results are shown in Table [V] where we report directly 
the mae scores (lower is better). As in the Sudoku domain, 
we observe the generally strong performance of the discov- 
ered algorithms. There are two generic algorithms, la(2) and 
la(3), ranked 5 th and 11 , respectively, that perform better 
than some of the discovered algorithms. We note, however, 
that la(2) has been rediscovered (Dis#3) and that most 
discovered algorithms exhibit a look-ahead search structure: 
step(lookahead(lookahead(-))). 

When setting the maximal length to T = 21, the discovered 
algorithms still outperform all the generic algorithms except 
la(2). When increasing the testing budget to B = 100,000, 
la(3) becomes more interesting than la(2) and nine discovered 
algorithms out of the ten outperform ia(3). These results thus 
show that the algorithms discovered by our approach are robust 
both w.r.t. to the maximal length T and to the budget B. 

In our last experiment with the distribution V P , we again 
observe that all ten discovered algorithms behave particularly 
well by outperforming all the generic algorithms except la{2) 
and la(2>). This result is particularly interesting since it shows 



that our approach was enable to discover algorithms that work 
well for symbolic regression in general, not only for some 
particular problems. 

When looking at the discovered search components, we 
observe that lookahead is intensively used in all good MCS 
algorithms. This is a strong indication that look-ahead search is 
a good strategy for this kind of expression discovery problems. 

D. Morpion Solitaire 

The classic game of morpion solitaire pT) is a single 
player, pencil and paper game, whose world record has been 
improved several times over the past few years using MCS 
techniques J5}, (7), p2| . This game is illustrated in Figure [3] 
The initial state X\ is an empty cross of points drawn on the 
intersections of the grid. Each action places a new point at a 
grid intersection in such a way that it forms a new line segment 
connecting consecutive points that include the new one. New 
lines can be drawn horizontally, vertically, and diagonally. The 
game is over when no further actions can be taken. The goal 
of the game is to maximize the number of lines drawn before 
the game ends, hence the reward associated to final states is 
this numbefl 

There exist two variants of the game: "Disjoint" and 
"Touching". "Touching" allows parallel lines to share an 
endpoint, whereas "Disjoint" does not. Line segments with 
different directions are always permitted to share points. The 
game is NP-hard (23) and presumed to be infinite under certain 
configurations. In this paper, we treat the 5D and 5T versions 
of the game, where 5 is the number of consecutive points to 
form a line, D means disjoint, and T means touching. 

3 In practice, we normalize this reward by dividing it by 100 to make it 
approximately fit into the range [0, 1]. Thanks to this normalization, we can 
keep using the same constants for both the UCB policy used in the algorithm 
discovery and the UCB policy used in select. 
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Fig. 3. A random policy that plays the game Morpion Solitaire 5T: initial grid; after 1 move; after 10 moves; game end. 



We performed the algorithm discovery in a "single training 
problem" scenario: the training distribution Dp always returns 
the same problem P, corresponding to the 5T version of the 
game. The initial state of P was the one given in the leftmost 
part of Figure [3] The training budget was set to B = 10, 000. 
To evaluate the robustness of the algorithms, we, on the one 
hand, evaluated them on the 5D variant of the problem and, 
on the other hand, changed the evaluation budget from 10,000 
to 100,000. The former provides a partial answer to how rule- 
dependent these algorithms are, while the latter gives insight 
into the impact of the budget on the algorithms' ranking. 

The results of our experiments on Morpion Solitaire are 



given in Table VI As before, the grammar was rich enough to 



generate several MCS algorithms which outperformed all the 
tested generic algorithms, with average scores of up to 91.43. 
Among the generic algorithms, Nested Monte Carlo gave the 
best results (90.68), which is 0.33 below the worst of the ten 
discovered algorithms. 

When moving to the 5D rules, we observe that all ten 
discovered algorithms still outperform the best generic algo- 
rithm. This is particularly impressive, since it is known that 
the structure of good solutions strongly differs between the 51? 
and 5T versions of the game |21 ]. The last column of Table VI 
gives the performance of the algorithms with budget B = lCFT 
We observe that all ten discovered algorithms also outperform 
the best generic algorithm in this case. Furthermore, the 
increase in the budget seems to also increase the gap between 
the discovered and the generic algorithms. 

When looking at the discovered search components, we 
observe the same kinds of patterns as for Sudoku: double 
nested step, and the use of select. 

E. Discussion 

We have seen that on each of our three testbeds, all ten 
discovered algorithms outperform the best generic algorithm. 
This clearly demonstrates that our approach is able to generate 
new MCS algorithms specifically tailored to the given class 
of problems. We have performed a study of the robustness of 
these algorithms by either changing the problem distribution 
or by varying the budget B, and found that the algorithms dis- 
covered can outperform generic algorithms even on problems 



significantly different from those used for the training. 

The importance of each component of the grammar depends 
heavily on the problem. For instance, in Symbolic Regres- 
sion, all ten best algorithms discovered rely on two nested 
lookahead components, whereas in Sudoku and Morpion, 
step and select appear in the majority of the best algorithms 
discovered. 

VI. Related Work 

Methods for automatically discovering MCS algorithms can 
be characterized through three main components: the space of 
candidate algorithms, the performance criterion, and the search 
method for finding the best element in the space of candidate 
algorithms. 

Usually, researchers consider spaces of candidate algorithms 
that only differ in the values of their constants. In such a 
context, the problem amounts to tuning the constants of a 
generic MCS algorithm. Most of the research related to the 
tuning of these constants takes as performance criterion the 
mean score of the algorithm over the distribution of the prob- 
lem instances. Many search algorithms have been proposed 
for computing the best constants. For instance, [24 1 employs 
a grid search approach combined with self -playing, |25] uses 
cross-entropy as a search method to tune an agent playing GO, 
p6[ presents a generic black-box optimization method based 
on local quadratic regression, [27 1 uses Estimation Distribution 
Algorithms with Gaussian distributions, p8[ uses Thompson 
Sampling, and [29 1 uses, as in the present paper, a multi-armed 
bandit approach. The paper [30] studies the influence of the 
tuning of MCS algorithms on their asymptotic consistency and 
shows that pathological behaviour may occur with tuning. It 
also proposes a tuning method to avoid such behaviour. 

Research papers that have reported empirical evaluations 
of several MCS algorithms in order to find the best one are 
also related to this automatic discovery problem. The space 
of candidate algorithms in such cases is the set of algorithms 
they compare, and the search method is an exhaustive search 
procedure. As a few examples, J24) reports on a comparison 
between algorithms that differ in their selection policy, (3TJ 
and [32 1 compare improvements of the UCT algorithm (RAVE 
and progressive bias) with the original one on the game of GO, 
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TABLE VI 

Ranking and Robustness of Algorithms Discovered when Applied to Morpion 



Name 


Search Component 


Rank 


01 


oU 


KT 1 D 1 fi5 
01 , D = W 


Dis#l 


J- / 1 4-/4- / 1 4- \ f\ C \ \ 

step(select(step(simulate),0.5)) 


1 


91.43 


63.57 


97.28 


Dis#7 


lookahead(step(step(select(simulate, 0)))) 


2 


91.30 


63.66 


96.28 


Dis#4 


step(select(step(select(simulate,0.5)),0)) 


3 


91.28 


63.61 


96.23 


Dis#3 


step(select(step(select(simulate, 1 .())),())) 


4 


91.27 


63.61 


96.13 


Dis#8 


step(select(step(step(simulate)),l)) 


5 


91.19 


63.68 


96.51 


Dis#2 


step(step(select(simulate,0))) 


6 


91.18 


63.63 


96.62 


Dis#9 


step(select(step(select(simulate,0)),0.3)) 


7 


91.17 


63.67 


95.99 


Dis#5 


select(step(select(step(simulate), 1 .0)),0) 


8 


91.16 


63.74 


95.86 


Dis#10 


step(select(step(select(simulate, 1 .0)),0.0)) 


9 


91.12 


63.61 


95.98 


Dis#6 


lookahead(step(step(simulate))) 


10 


91.01 


63.84 


96.38 


nmc(4) 




1 1 


90.68 


63.36 


95.43 


nmc(3) 




12 


90.67 


63.51 


95.61 


la(l) 




13 


90.63 


63.43 


95.31 


nmc(2) 




14 


90.54 


63.52 


95.50 


nmc(5) 




15 


90.44 


63.46 


95.31 


uct(O) 




16 


90.35 


63.02 


92.79 


ucl(() 51 




17 


90.25 


62.91 


92.33 


uct(l) 




18 


90.22 


63.12 


92.80 


uct(0.3) 




19 


90.01 


63.03 


92.54 


la(2) 




20 


88.91 


62.67 


95.76 


la(3) 




21 


85.87 


61.52 


89.35 


is 




21 


85.87 


61.40 


89.15 


la(4) 




23 


85.80 


61.53 


88.87 


mcts 




24 


85.79 


61.48 


89.15 


la(5) 




25 


85.78 


61.52 


88.87 



and [33] evaluates different versions of a two-player MCS 
algorithm on generic sparse bandit problems. [34- 1 provides 
an in-depth review of different MCS algorithms and their 
successes in different applications. 

The main feature of the approach proposed in the present 
paper is that it builds the space of candidate algorithms by 
using a rich grammar over the search components. In this 



sense, |35|, (36J are certainly the papers which are the closest 
to ours, since they also use a grammar to define a search space, 
for, respectively, two player games and multi-armed bandit 
problems. However, in both cases, this grammar only models 
a selection policy and is made of classic functions such as +, 
— , *, /, log, exp, and We have taken one step forward, 
by directly defining a grammar over the MCS algorithms that 
covers very different MCS techniques. Note that the search 
technique of (35) is based on genetic programming. 

The decision as to what to use as the performance crite- 
rion is not as trivial as it looks, especially for multi-player 
games, where opponent modelling is crucial for improving 
over game-theoretically optimal play (37]. For example, the 
maximization of the victory rate or loss minimization against 
a wide variety of opponents for a specific game can lead to 
very different choices of algorithms. Other examples of criteria 
to discriminate between algorithms are simple regret [29 ] and 
the expected performance over a distribution density |38) . 

VII. Conclusion 

In this paper we have addressed the problem of automati- 
cally identifying new Monte Carlo search (MCS) algorithms 
performing well on a distribution of training problems. To 
do so, we introduced a grammar over the MCS algorithms 
that generates a very rich space of candidate algorithms (and 
which describes, along the way, using a particularly compact 



and elegant description, several well known MCS algorithms). 
To efficiently search inside this space of candidate algorithms 
for the one(s) having the best average performance on the 
training problems, we relied on a multi-armed bandit type of 
optimisation algorithm. 

Our approach was tested on three different domains: Su- 
doku, Morpion Solitaire, and Symbolic Regression. The results 
showed that the algorithms discovered this way can signifi- 
cantly outperform generic algorithms such as UCT or NMC 
on each of these domains. Moreover, we showed that they 
had good robustness properties, by changing the testing budget 
and/or by using a testing problem distribution different from 
the training distribution. 

This work can be extended in several ways. For the time be- 
ing, we used the mean performance over a set of training prob- 
lems to discriminate between different candidate algorithms. 
One direction for future work would be to adapt our general 
approach to use other criteria, e.g., worst case performance 
measures. In its current form, our grammar only allows using 
predefined simulation policies. Since the simulation policy 
typically has a major impact on the performance of a MCS 
algorithm, it could be interesting to extend our grammar so that 
it could also "generate" new simulation policies. This could be 
arranged by adding a set of simulation policy generators in the 
spirit of our current search component generators. Of course, 
working with richer grammars will lead to larger candidate 
algorithm spaces, which in turn, may require developing more 
efficient search methods than the multi-armed bandit one used 
in this paper. Finally, another important direction for future 
research is to extend our approach to more general settings 
than single-player games with full observability. 
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